Skip to content

feat(ingestion/hex): Allow category filters#16752

Open
alisa-aylward-toast wants to merge 12 commits intodatahub-project:masterfrom
alisa-aylward-toast:feat(ingestion/hex)-allow-category-filters
Open

feat(ingestion/hex): Allow category filters#16752
alisa-aylward-toast wants to merge 12 commits intodatahub-project:masterfrom
alisa-aylward-toast:feat(ingestion/hex)-allow-category-filters

Conversation

@alisa-aylward-toast
Copy link
Copy Markdown
Contributor

@alisa-aylward-toast alisa-aylward-toast commented Mar 24, 2026

Allowing for filtering on categories in Hex. In the ingestion, users can decide if they want to include/exclude certain Hex categories.

  • The PR conforms to DataHub's Contributing Guideline (particularly PR Title Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels Mar 24, 2026
@alisa-aylward-toast alisa-aylward-toast changed the title add tests feat(ingestion/hex) Allow category filters Mar 24, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 25, 2026

Bundle Report

Changes will increase total bundle size by 13.65kB (0.06%) ⬆️. This is within the configured threshold ✅

Detailed changes
Bundle name Size Change
datahub-react-web-esm 22.7MB 13.65kB (0.06%) ⬆️

Affected Assets, Files, and Routes:

view changes for bundle: datahub-react-web-esm

Assets Changed:

Asset Name Size Change Total Size Change (%)
assets/index-*.js 517 bytes 12.45MB 0.0%
assets/fabriclogo-*.svg (New) 8.86kB 8.86kB 100.0% 🚀
assets/fabricdatafactorylogo-*.svg (New) 4.27kB 4.27kB 100.0% 🚀

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 25, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@alisa-aylward-toast alisa-aylward-toast changed the title feat(ingestion/hex) Allow category filters feat(ingestion/hex): Allow category filters Mar 25, 2026
@alisa-aylward-toast alisa-aylward-toast marked this pull request as ready for review March 25, 2026 16:47
@github-actions
Copy link
Copy Markdown
Contributor

Linear: ING-2067

Thanks for your contribution! We have created an internal ticket to track this PR. A member of the core DataHub team will be assigned to review it within the next few business days - you will get a follow-up comment once a reviewer is assigned.

@maggiehays maggiehays added the needs-review Label for PRs that need review from a maintainer. label Mar 25, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Your PR has been assigned to @dineshrathi-dh (dinesh.rathi) for review (ING-2067).

@github-actions github-actions bot requested a review from dineshrathi-dh March 26, 2026 00:52
if not self.source_config.category_pattern.allowed(
category.name
):
skip_item = True
Copy link
Copy Markdown
Contributor

@alokr-dhub alokr-dhub Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from the code it seems the requirement is to skip the project_or_component entirely if any of the categories does not match the allowed pattern. Please add more info in the description explaining the expected behavior. Even considering the expected behavior based on code logic, the code looks more complex than needed. this can be simplified as

if project_or_component.categories and any(
    not self.source_config.category_pattern.allowed(c.name)
    for c in project_or_component.categories               
):                                          
    continue  

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question on how patterns work -- if the no pattern is added for this in the ingestion, will self.source_config.category_pattern.allowed always be true?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. the default is set to allow_all(). .allowed always returns True unless an allow list is specified.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, i would suggest moving the filtering logic to a separate method will be cleaner.

"categories": [{
"name": "Keep_Scratchpad",
"description": "Intended for broad consumption"
}],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add more test cases with mix of allowed and denied categories for more rigorous testing.

@maggiehays maggiehays removed the needs-review Label for PRs that need review from a maintainer. label Mar 26, 2026
@maggiehays maggiehays added the pending-submitter-response Issue/request has been reviewed but requires a response from the submitter label Mar 26, 2026
def get_workunits_internal(self) -> Iterable[MetadataWorkUnit]:
with self.report.new_stage("Fetch Hex assets from Hex API"):
for project_or_component in self.hex_api.fetch_projects():
if project_or_component.categories and any(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alokr-dhub where would you suggest putting this logic in terms of a new function? I ask because having it here pattern matches lines 270 and 278, so I didn't find a natural place to put the function.
i could add it to the HexAPI class if you think that's a good spot for it

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. we can keep it here now. ideally all the checks should be in a separate internal method in the hex source class

@alokr-dhub
Copy link
Copy Markdown
Contributor

@alisa-aylward-toast LGTM. Please add the test cases.

@alisa-aylward-toast
Copy link
Copy Markdown
Contributor Author

@alokr-dhub since your first check I did add two more test cases. I added one with two categories -- one that is excluded -- https://github.com/datahub-project/datahub/pull/16752/changes#diff-5c43981e8e278bbb4aec0157f6dc7bce9da1425d7d6e7485ca807a223a499035R101
and I added tests
alisa-aylward-toast@b72b295

@maggiehays maggiehays added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Mar 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata needs-review Label for PRs that need review from a maintainer.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants